DSK: k-mer counting with very low memory usage.
Identifieur interne : 000132 ( France/Analysis ); précédent : 000131; suivant : 000133DSK: k-mer counting with very low memory usage.
Auteurs : Guillaume Rizk [France] ; Dominique Lavenier ; Rayan ChikhiSource :
- Bioinformatics (Oxford, England) [ 1367-4811 ] ; 2013.
Descripteurs français
- KwdFr :
- MESH :
English descriptors
- KwdEn :
- MESH :
- methods : Sequence Analysis, DNA, Sequence Analysis, RNA.
- Algorithms, Genome, Human, Humans, Software.
Abstract
Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count. We present a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed user-defined amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned, and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low-abundance k-mers are optionally filtered. DSK is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory and moderate disk space (160 GB), in 17.9 h. DSK can replace a popular k-mer counting software (Jellyfish) on small-memory servers.
DOI: 10.1093/bioinformatics/btt020
PubMed: 23325618
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream PubMed, to step Corpus: 001D24
- to stream PubMed, to step Curation: 001D24
- to stream PubMed, to step Checkpoint: 001B72
- to stream Ncbi, to step Merge: 000A17
- to stream Ncbi, to step Curation: 000A17
- to stream Ncbi, to step Checkpoint: 000A17
- to stream Main, to step Merge: 002043
- to stream Main, to step Curation: 002022
- to stream Main, to step Exploration: 002022
- to stream France, to step Extraction: 000132
Links to Exploration step
pubmed:23325618Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">DSK: k-mer counting with very low memory usage.</title>
<author><name sortKey="Rizk, Guillaume" sort="Rizk, Guillaume" uniqKey="Rizk G" first="Guillaume" last="Rizk">Guillaume Rizk</name>
<affiliation wicri:level="3"><nlm:affiliation>Algorizk, 75013 Paris, France.</nlm:affiliation>
<country xml:lang="fr">France</country>
<wicri:regionArea>Algorizk, 75013 Paris</wicri:regionArea>
<placeName><region type="region" nuts="2">Île-de-France</region>
<settlement type="city">Paris</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Lavenier, Dominique" sort="Lavenier, Dominique" uniqKey="Lavenier D" first="Dominique" last="Lavenier">Dominique Lavenier</name>
</author>
<author><name sortKey="Chikhi, Rayan" sort="Chikhi, Rayan" uniqKey="Chikhi R" first="Rayan" last="Chikhi">Rayan Chikhi</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PubMed</idno>
<date when="2013">2013</date>
<idno type="RBID">pubmed:23325618</idno>
<idno type="pmid">23325618</idno>
<idno type="doi">10.1093/bioinformatics/btt020</idno>
<idno type="wicri:Area/PubMed/Corpus">001D24</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">001D24</idno>
<idno type="wicri:Area/PubMed/Curation">001D24</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">001D24</idno>
<idno type="wicri:Area/PubMed/Checkpoint">001B72</idno>
<idno type="wicri:explorRef" wicri:stream="Checkpoint" wicri:step="PubMed">001B72</idno>
<idno type="wicri:Area/Ncbi/Merge">000A17</idno>
<idno type="wicri:Area/Ncbi/Curation">000A17</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000A17</idno>
<idno type="wicri:Area/Main/Merge">002043</idno>
<idno type="wicri:Area/Main/Curation">002022</idno>
<idno type="wicri:Area/Main/Exploration">002022</idno>
<idno type="wicri:Area/France/Extraction">000132</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">DSK: k-mer counting with very low memory usage.</title>
<author><name sortKey="Rizk, Guillaume" sort="Rizk, Guillaume" uniqKey="Rizk G" first="Guillaume" last="Rizk">Guillaume Rizk</name>
<affiliation wicri:level="3"><nlm:affiliation>Algorizk, 75013 Paris, France.</nlm:affiliation>
<country xml:lang="fr">France</country>
<wicri:regionArea>Algorizk, 75013 Paris</wicri:regionArea>
<placeName><region type="region" nuts="2">Île-de-France</region>
<settlement type="city">Paris</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Lavenier, Dominique" sort="Lavenier, Dominique" uniqKey="Lavenier D" first="Dominique" last="Lavenier">Dominique Lavenier</name>
</author>
<author><name sortKey="Chikhi, Rayan" sort="Chikhi, Rayan" uniqKey="Chikhi R" first="Rayan" last="Chikhi">Rayan Chikhi</name>
</author>
</analytic>
<series><title level="j">Bioinformatics (Oxford, England)</title>
<idno type="eISSN">1367-4811</idno>
<imprint><date when="2013" type="published">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Algorithms</term>
<term>Genome, Human</term>
<term>Humans</term>
<term>Sequence Analysis, DNA (methods)</term>
<term>Sequence Analysis, RNA (methods)</term>
<term>Software</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr"><term>Algorithmes</term>
<term>Analyse de séquence d'ADN ()</term>
<term>Analyse de séquence d'ARN ()</term>
<term>Génome humain</term>
<term>Humains</term>
<term>Logiciel</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en"><term>Sequence Analysis, DNA</term>
<term>Sequence Analysis, RNA</term>
</keywords>
<keywords scheme="MESH" xml:lang="en"><term>Algorithms</term>
<term>Genome, Human</term>
<term>Humans</term>
<term>Software</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr"><term>Algorithmes</term>
<term>Analyse de séquence d'ADN</term>
<term>Analyse de séquence d'ARN</term>
<term>Génome humain</term>
<term>Humains</term>
<term>Logiciel</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count. We present a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed user-defined amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned, and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low-abundance k-mers are optionally filtered. DSK is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory and moderate disk space (160 GB), in 17.9 h. DSK can replace a popular k-mer counting software (Jellyfish) on small-memory servers.</div>
</front>
</TEI>
<affiliations><list><country><li>France</li>
</country>
<region><li>Île-de-France</li>
</region>
<settlement><li>Paris</li>
</settlement>
</list>
<tree><noCountry><name sortKey="Chikhi, Rayan" sort="Chikhi, Rayan" uniqKey="Chikhi R" first="Rayan" last="Chikhi">Rayan Chikhi</name>
<name sortKey="Lavenier, Dominique" sort="Lavenier, Dominique" uniqKey="Lavenier D" first="Dominique" last="Lavenier">Dominique Lavenier</name>
</noCountry>
<country name="France"><region name="Île-de-France"><name sortKey="Rizk, Guillaume" sort="Rizk, Guillaume" uniqKey="Rizk G" first="Guillaume" last="Rizk">Guillaume Rizk</name>
</region>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/France/Analysis
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000132 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/France/Analysis/biblio.hfd -nk 000132 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Sante |area= MersV1 |flux= France |étape= Analysis |type= RBID |clé= pubmed:23325618 |texte= DSK: k-mer counting with very low memory usage. }}
Pour générer des pages wiki
HfdIndexSelect -h $EXPLOR_AREA/Data/France/Analysis/RBID.i -Sk "pubmed:23325618" \ | HfdSelect -Kh $EXPLOR_AREA/Data/France/Analysis/biblio.hfd \ | NlmPubMed2Wicri -a MersV1
This area was generated with Dilib version V0.6.33. |